library("ggplot2","gridExtra")
SID <- 2135198
SIDoffset <- (SID %% 25) + 1
load("CS5801_football_analysis.Rda")
mydf <- football.analysis[seq(from=SIDoffset,to=nrow(football.analysis),by=25),]
View(mydf)
I will go through my subsetted data systematically checking for any errors. This will be done by using the view function and a missing value indicator function. Then using the summary function in r, I will check for any numerical errors in the data such as given statistics not falling within a range as they should. Also I will be checking if all variables have been categorized correctly by r using the structure function.
First I look at the data by using the View() function and then identify any na values using is.na()
View(mydf) # to see the whole data frame
is.na(mydf) # checks for any missing values that may have occurred under any variable
sofifa_id potential wage_eur age height_cm weight_kg club_name
24 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
49 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
74 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
99 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
124 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
149 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
174 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
199 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
224 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
249 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
274 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
299 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
324 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
349 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
374 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
399 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
424 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
449 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
474 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
499 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
524 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
549 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
574 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
599 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
624 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
649 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
674 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
699 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
724 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
749 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
774 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
799 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
824 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
849 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
874 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
899 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
924 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
949 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
974 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
999 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1024 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1049 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1074 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1099 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1124 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1149 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1174 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1199 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1224 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1249 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1274 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1299 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1324 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1349 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1374 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1399 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1424 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1449 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
preferred_foot pace shooting passing dribbling defending physic
24 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
49 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
74 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
99 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
124 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
149 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
174 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
199 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
224 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
249 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
274 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
299 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
324 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
349 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
374 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
399 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
424 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
449 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
474 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
499 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
524 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
549 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
574 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
599 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
624 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
649 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
674 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
699 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
724 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
749 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
774 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
799 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
824 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
849 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
874 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
899 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
924 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
949 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
974 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
999 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1024 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1049 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1074 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1099 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1124 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1149 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1174 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1199 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1224 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1249 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1274 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1299 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1324 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1349 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1374 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1399 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1424 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
1449 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
power_strength power_long_shots high.wage.ind
24 FALSE FALSE FALSE
49 FALSE FALSE FALSE
74 FALSE FALSE FALSE
99 FALSE FALSE FALSE
124 FALSE FALSE FALSE
149 FALSE FALSE FALSE
174 FALSE FALSE FALSE
199 FALSE FALSE FALSE
224 FALSE FALSE FALSE
249 FALSE FALSE FALSE
274 FALSE FALSE FALSE
299 FALSE FALSE FALSE
324 FALSE FALSE FALSE
349 FALSE FALSE FALSE
374 FALSE FALSE FALSE
399 FALSE FALSE FALSE
424 FALSE FALSE FALSE
449 FALSE FALSE FALSE
474 FALSE FALSE FALSE
499 FALSE FALSE FALSE
524 FALSE FALSE FALSE
549 FALSE FALSE FALSE
574 FALSE FALSE FALSE
599 FALSE FALSE FALSE
624 FALSE FALSE FALSE
649 FALSE FALSE FALSE
674 FALSE FALSE FALSE
699 FALSE FALSE FALSE
724 FALSE FALSE FALSE
749 FALSE FALSE FALSE
774 FALSE FALSE FALSE
799 FALSE FALSE FALSE
824 FALSE FALSE FALSE
849 FALSE FALSE FALSE
874 FALSE FALSE FALSE
899 FALSE FALSE FALSE
924 FALSE FALSE FALSE
949 FALSE FALSE FALSE
974 FALSE FALSE FALSE
999 FALSE FALSE FALSE
1024 FALSE FALSE FALSE
1049 FALSE FALSE FALSE
1074 FALSE FALSE FALSE
1099 FALSE FALSE FALSE
1124 FALSE FALSE FALSE
1149 FALSE FALSE FALSE
1174 FALSE FALSE FALSE
1199 FALSE FALSE FALSE
1224 FALSE FALSE FALSE
1249 FALSE FALSE FALSE
1274 FALSE FALSE FALSE
1299 FALSE FALSE FALSE
1324 FALSE FALSE FALSE
1349 FALSE FALSE FALSE
1374 FALSE FALSE FALSE
1399 FALSE FALSE FALSE
1424 FALSE FALSE FALSE
1449 FALSE FALSE FALSE
[ reached getOption("max.print") -- omitted 456 rows ]
At a quick first glance there is no obvious glaring errors by looking at the view function and the table provided from the missing value indicator function shows no missing values. Then I check the summary of my data to see any obvious errors and to see if everything looks normal:
summary(mydf)
sofifa_id potential wage_eur age
Min. :122849 Min. :47.0 Min. : 2.1 Min. :17.00
1st Qu.:211438 1st Qu.:67.0 1st Qu.: 1000.0 1st Qu.:22.00
Median :232724 Median :71.0 Median : 3000.0 Median :25.00
Mean :227623 Mean :71.2 Mean : 10785.7 Mean :25.56
3rd Qu.:247191 3rd Qu.:75.0 3rd Qu.: 10000.0 3rd Qu.:29.00
Max. :258946 Max. :90.0 Max. :260000.0 Max. :79.00
height_cm weight_kg club_name preferred_foot
Min. :159.0 Min. :33.00 Length:514 Length:514
1st Qu.:175.0 1st Qu.:70.00 Class :character Class :character
Median :180.0 Median :74.00 Mode :character Mode :character
Mean :180.2 Mean :74.24
3rd Qu.:185.0 3rd Qu.:79.00
Max. :226.0 Max. :94.00
pace shooting passing dribbling
Min. :-82.0 Min. :18.00 Min. :31.00 Min. :-62.00
1st Qu.: 60.0 1st Qu.:42.00 1st Qu.:51.00 1st Qu.: 56.00
Median : 67.0 Median :55.00 Median :58.00 Median : 64.00
Mean : 66.9 Mean :53.22 Mean :57.96 Mean : 62.68
3rd Qu.: 74.0 3rd Qu.:64.00 3rd Qu.:66.00 3rd Qu.: 70.00
Max. : 95.0 Max. :86.00 Max. :89.00 Max. : 88.00
defending physic power_strength power_long_shots
Min. :17.00 Min. :35.00 Min. :24.00 Min. :15.0
1st Qu.:37.00 1st Qu.:58.00 1st Qu.:58.00 1st Qu.:40.0
Median :55.00 Median :66.00 Median :67.00 Median :55.0
Mean :51.32 Mean :64.72 Mean :65.99 Mean :52.3
3rd Qu.:64.00 3rd Qu.:71.00 3rd Qu.:75.00 3rd Qu.:65.0
Max. :86.00 Max. :87.00 Max. :93.00 Max. :92.0
high.wage.ind
Min. :0.0000
1st Qu.:0.0000
Median :0.0000
Mean :0.2821
3rd Qu.:1.0000
Max. :1.0000
We can see quiet a few questionable values above from the summary function such as minimum wage being 2.1 euros, maximum age of a footballer being 79, a footballer as tall as 226cm (which is 7 foot and 4 inches), minimum pace being -82 and minimum dribbling being -62 which is implausible as the ranges should be between 0 and 100.
The final check will involve looking at the structure function:
str(mydf)
'data.frame': 514 obs. of 17 variables:
$ sofifa_id : int 177003 204963 210243 200458 169195 199845 225953 198710 226271 200260 ...
$ potential : int 87 86 87 85 83 83 88 82 88 81 ...
$ wage_eur : num 260000 230000 120000 110000 33000 75000 105000 105000 70000 23000 ...
$ age : int 34 28 26 26 32 32 22 28 24 28 ...
$ height_cm : int 172 173 175 178 186 192 178 180 189 182 ...
$ weight_kg : int 66 73 70 74 86 88 78 75 70 75 ...
$ club_name : chr "Real Madrid CF" "Real Madrid CF" "Leicester City" "Everton" ...
$ preferred_foot : chr "Right" "Right" "Right" "Left" ...
$ pace : int 73 80 83 78 71 62 87 53 63 80 ...
$ shooting : int 76 54 66 69 79 50 79 86 78 80 ...
$ passing : int 89 78 79 80 82 62 78 85 78 83 ...
$ dribbling : int 88 80 82 79 81 63 85 86 83 82 ...
$ defending : int 71 82 81 80 72 86 39 50 74 36 ...
$ physic : int 66 80 76 76 77 83 72 63 68 64 ...
$ power_strength : int 58 74 69 69 79 86 78 64 67 58 ...
$ power_long_shots: int 82 47 64 76 81 60 83 92 83 84 ...
$ high.wage.ind : int 1 1 1 1 1 1 1 1 1 1 ...
Also we see data types for preferred foot, high wage indicator and club name are either character data types or integer when they should be categorical/binary data types.
Now again to view the structure of the data frame:
str(mydf)
'data.frame': 514 obs. of 17 variables:
$ sofifa_id : int 177003 204963 210243 200458 169195 199845 225953 198710 226271 200260 ...
$ potential : int 87 86 87 85 83 83 88 82 88 81 ...
$ wage_eur : num 260000 230000 120000 110000 33000 75000 105000 105000 70000 23000 ...
$ age : int 34 28 26 26 32 32 22 28 24 28 ...
$ height_cm : int 172 173 175 178 186 192 178 180 189 182 ...
$ weight_kg : int 66 73 70 74 86 88 78 75 70 75 ...
$ club_name : chr "Real Madrid CF" "Real Madrid CF" "Leicester City" "Everton" ...
$ preferred_foot : chr "Right" "Right" "Right" "Left" ...
$ pace : int 73 80 83 78 71 62 87 53 63 80 ...
$ shooting : int 76 54 66 69 79 50 79 86 78 80 ...
$ passing : int 89 78 79 80 82 62 78 85 78 83 ...
$ dribbling : int 88 80 82 79 81 63 85 86 83 82 ...
$ defending : int 71 82 81 80 72 86 39 50 74 36 ...
$ physic : int 66 80 76 76 77 83 72 63 68 64 ...
$ power_strength : int 58 74 69 69 79 86 78 64 67 58 ...
$ power_long_shots: int 82 47 64 76 81 60 83 92 83 84 ...
$ high.wage.ind : int 1 1 1 1 1 1 1 1 1 1 ...
The variables preferred foot and high wage indicator should be binary categorical variables. R has read them in as character and integer variables respectively so this can be fixed by making them into factor variables. Also the club name should be a categorical variable with many levels and is currently a character variable, I will also convert this into a factor variable.
mydf$preferred_foot <- as.factor(mydf$preferred_foot) # transformation into factors
mydf$high.wage.ind <- as.factor(mydf$high.wage.ind)
mydf$club_name <- as.factor(mydf$club_name)
str(mydf)
'data.frame': 514 obs. of 17 variables:
$ sofifa_id : int 177003 204963 210243 200458 169195 199845 225953 198710 226271 200260 ...
$ potential : int 87 86 87 85 83 83 88 82 88 81 ...
$ wage_eur : num 260000 230000 120000 110000 33000 75000 105000 105000 70000 23000 ...
$ age : int 34 28 26 26 32 32 22 28 24 28 ...
$ height_cm : int 172 173 175 178 186 192 178 180 189 182 ...
$ weight_kg : int 66 73 70 74 86 88 78 75 70 75 ...
$ club_name : Factor w/ 332 levels "1. FC Köln","1. FSV Mainz 05",..: 252 252 199 115 42 195 303 115 217 142 ...
$ preferred_foot : Factor w/ 3 levels "Left","right",..: 3 3 3 1 3 1 3 1 1 1 ...
$ pace : int 73 80 83 78 71 62 87 53 63 80 ...
$ shooting : int 76 54 66 69 79 50 79 86 78 80 ...
$ passing : int 89 78 79 80 82 62 78 85 78 83 ...
$ dribbling : int 88 80 82 79 81 63 85 86 83 82 ...
$ defending : int 71 82 81 80 72 86 39 50 74 36 ...
$ physic : int 66 80 76 76 77 83 72 63 68 64 ...
$ power_strength : int 58 74 69 69 79 86 78 64 67 58 ...
$ power_long_shots: int 82 47 64 76 81 60 83 92 83 84 ...
$ high.wage.ind : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
Preferred foot seems to show 3 levels this seems suspicious lets explore this further as logic would tell us that you could only be left or right footed, we are to going to get a table of responses for the preferred foot category:
table(mydf$preferred_foot) # table of responses for category preferred foot
Left right Right
149 1 364
We can see from above an error caused by using a lower case in spelling ‘right’, we will fix this error:
levels(mydf$preferred_foot)[levels(mydf$preferred_foot)=="right"] <-"Right"
# combines two factors to a single factor
We do not have the required domain knowledge to know why these 5 numerical outliers occurred and hence I will consider deleting them. However, there is no chance a footballer is making 2.1 euros a week or playing at the age of a 79 or having negative pace and dribbling attributes. For these 4 outliers mentioned I will delete them from the dataset. A footballer as tall as 226cm (which is 7 foot and 4 inches), does seem very unlikely but however not impossible. There have been humans as tall in the past so I will treat this outlier as a phenomena of interest instead of as noise by leaving it.
We now locate the 4 rows with these unwanted outliers:
which(mydf$age == 79) # returns row position of desired element
[1] 173
which(mydf$wage_eur == 2.10001 )
[1] 205
which(mydf$pace == -82 )
[1] 241
which(mydf$dribbling == -62)
[1] 303
Now to delete these 4 rows of entries and assign a new data frame to the remaining data:
cleandf <- mydf[-c(173, 205, 241, 303), ] #subtracts a combination of rows given
I will start by asking 3 vital questions: What data types are my variables in this data set, the kind of variation that occurs within each variable and finally the kind of covariation between each individual variable. Then I will view the raw data and try to identify any trends or patterns between different variables or data points within a variable. I will then try to visualise the data by executing different plots and graphs such as histograms and box plots for numerical/continuous variables and frequency tables and bar charts for categorical/binary variables. If then necessary I will transform such variables to appropriate scales to better visualise the data. Visualising data will all help to answer any questions that I may find from viewing the raw data and to answer the given research questions.
We now load the cleaned data frame:
cleandf
I will run a summary function on this set, to get some basic summary statistics:
summary(cleandf)
sofifa_id potential wage_eur age
Min. :122849 Min. :47.00 Min. : 500 Min. :17.00
1st Qu.:211438 1st Qu.:67.00 1st Qu.: 1000 1st Qu.:22.00
Median :232724 Median :71.00 Median : 3000 Median :25.00
Mean :227609 Mean :71.23 Mean : 10845 Mean :25.44
3rd Qu.:247191 3rd Qu.:75.00 3rd Qu.: 10000 3rd Qu.:29.00
Max. :258946 Max. :90.00 Max. :260000 Max. :38.00
height_cm weight_kg club_name preferred_foot
Min. :159.0 Min. :33.00 Cracovia : 5 Left :147
1st Qu.:175.0 1st Qu.:70.00 Aytemiz Alanyaspor : 4 Right:363
Median :180.0 Median :74.00 Fulham : 4
Mean :180.2 Mean :74.25 Odense Boldklub : 4
3rd Qu.:185.0 3rd Qu.:79.00 PGE FKS Stal Mielec: 4
Max. :226.0 Max. :94.00 Torino F.C. : 4
(Other) :485
pace shooting passing dribbling
Min. :32.00 Min. :18.00 Min. :31.00 Min. :30.00
1st Qu.:60.00 1st Qu.:42.00 1st Qu.:51.00 1st Qu.:56.00
Median :67.00 Median :55.00 Median :58.00 Median :64.00
Mean :67.14 Mean :53.18 Mean :57.94 Mean :62.86
3rd Qu.:74.00 3rd Qu.:64.00 3rd Qu.:66.00 3rd Qu.:70.00
Max. :95.00 Max. :86.00 Max. :89.00 Max. :88.00
defending physic power_strength power_long_shots
Min. :17.00 Min. :35.00 Min. :24.00 Min. :15.00
1st Qu.:37.00 1st Qu.:58.00 1st Qu.:58.00 1st Qu.:40.00
Median :55.00 Median :66.00 Median :67.00 Median :55.00
Mean :51.41 Mean :64.76 Mean :66.01 Mean :52.24
3rd Qu.:64.00 3rd Qu.:71.00 3rd Qu.:75.00 3rd Qu.:65.00
Max. :86.00 Max. :87.00 Max. :93.00 Max. :92.00
high.wage.ind
0:366
1:144
We can see that the new data set does not now show any outright obvious outliers, compared to the original uncleaned set. We can see that for the categorical variables such as preferred foot and high wage indicator have 2 levels each.
For club name it is not obvious to see the number of levels from the summary statistic:
levels <- attributes(cleandf$club_name)$levels # assigns a function to list every level for club name
length(levels) # finds the length of said function
[1] 332
Here we can see club names has 332 levels as a categorical variable, i.e 332 different professional clubs.
First I will produce some general plots for variables, histograms for continuous variables and bar charts for categorical variables:
ggplot(cleandf,aes(x=potential))+geom_histogram(bins = 20) + theme_classic()+ ggtitle("potential") # histogram for given variable includes the numbers of bins,the title of plot and axis
ggplot(cleandf,aes(x=wage_eur))+geom_histogram(bins = 20) + theme_classic()+ ggtitle("wage_eur")
ggplot(cleandf,aes(x=age))+geom_histogram(bins = 20) + theme_classic()+ ggtitle("age")
ggplot(cleandf,aes(x=height_cm))+geom_histogram(bins = 20) + theme_classic()+ ggtitle("height_cm")
ggplot(cleandf,aes(x=weight_kg))+geom_histogram(bins = 20) + theme_classic()+ ggtitle("weight_kg")
ggplot(cleandf,aes(x=pace))+geom_histogram(bins = 20) + theme_classic()+ ggtitle("pace")
ggplot(cleandf,aes(x=shooting))+geom_histogram(bins = 20) + theme_classic()+ ggtitle("shooting")
ggplot(cleandf,aes(x=passing))+geom_histogram(bins = 20) + theme_classic()+ ggtitle("passing")
ggplot(cleandf,aes(x=dribbling))+geom_histogram(bins = 20) + theme_classic()+ ggtitle("dribbling")
ggplot(cleandf,aes(x=defending))+geom_histogram(bins = 20) + theme_classic()+ ggtitle("defending")
ggplot(cleandf,aes(x=physic))+geom_histogram(bins = 20) + theme_classic()+ ggtitle("physic")
ggplot(cleandf,aes(x=power_strength))+geom_histogram(bins = 20) + theme_classic()+ ggtitle("power_strength")
ggplot(cleandf,aes(x=power_long_shots))+geom_histogram(bins = 20) + theme_classic()+ ggtitle("power_long_shots")
ggplot(cleandf,aes(x=club_name,y=frequency(club_name)))+geom_bar(stat="identity")+labs(title="club_name",x="club_name",y=" frequency") #bar chart for each variable against its frequency
ggplot(cleandf,aes(x=preferred_foot,y=frequency(preferred_foot)))+geom_bar(stat="identity")+labs(title="preferred_foot",x="preferred_foot",y=" frequency")
ggplot(cleandf,aes(x=high.wage.ind,y=frequency(high.wage.ind)))+geom_bar(stat="identity")+labs(title="high wage indicator",x="high.wage.ind",y=" frequency")
The wage variable looks skewed to the left but apart from this the rest of the continuous distributions look fairly symmetrical.
The bar charts for the categorical variables suggest that there are more right footed players than left footed as well as that vast majority of players do not earn above 8000 euros a week. The bar chart for club names is not helpful as there are hundreds of teams which makes the plot unreadable and meaningless.
Now for other visualisations, due to the first research question I will produce plots for potential as the dependent variable.
I will produce scatter plots of potential against other numerical/continuous variables and then box plots for potential against club name, preferred foot and high wage indicator:
ggplot(cleandf,aes(x=wage_eur,y=potential))+geom_point()+labs(title="potential vs. wage",x="wage_eur",y=" potential") + geom_smooth(method='lm', formula= y~x) # scatter plot of potential vs continuous variable with a best line of fit
ggplot(cleandf,aes(x=age,y=potential))+geom_point()+labs(title="potential vs. age",x="age",y=" potential") + geom_smooth(method='lm', formula= y~x)
ggplot(cleandf,aes(x=height_cm,y=potential))+geom_point()+labs(title="potential vs. height",x="height_cm",y=" potential") + geom_smooth(method='lm', formula= y~x)
ggplot(cleandf,aes(x=weight_kg,y=potential))+geom_point()+labs(title="potential vs. weight",x="weight_kg",y=" potential") + geom_smooth(method='lm', formula= y~x)
ggplot(cleandf,aes(x=pace,y=potential))+geom_point()+labs(title="potential vs. pace",x="pace",y=" potential") + geom_smooth(method='lm', formula= y~x)
ggplot(cleandf,aes(x=shooting,y=potential))+geom_point()+labs(title="potential vs. shooting",x="shooting",y=" potential") + geom_smooth(method='lm', formula= y~x)
ggplot(cleandf,aes(x=passing,y=potential))+geom_point()+labs(title="potential vs. passing",x="passing",y=" potential") + geom_smooth(method='lm', formula= y~x)
ggplot(cleandf,aes(x=dribbling,y=potential))+geom_point()+labs(title="potential vs. dribbling",x="dribbling",y=" potential") + geom_smooth(method='lm', formula= y~x)
ggplot(cleandf,aes(x=defending,y=potential))+geom_point()+labs(title="potential vs. defending",x="defending",y=" potential") + geom_smooth(method='lm', formula= y~x)
ggplot(cleandf,aes(x=physic,y=potential))+geom_point()+labs(title="potential vs. physic",x="physic",y=" potential") + geom_smooth(method='lm', formula= y~x)
ggplot(cleandf,aes(x=power_strength,y=potential))+geom_point()+labs(title="potential vs. power_strength",x="power_strength",y=" potential") + geom_smooth(method='lm', formula= y~x)
ggplot(cleandf,aes(x=power_long_shots,y=potential))+geom_point()+labs(title="potential vs. power_long_shots",x="power_long_shots",y=" potential") + geom_smooth(method='lm', formula= y~x)
ggplot(cleandf, aes(x=club_name, y=potential)) + geom_boxplot() + ggtitle("potential vs. club_name")+ theme_classic() #box plot with potential vs categorical variable
ggplot(cleandf, aes(x=high.wage.ind, y=potential)) + geom_boxplot() + ggtitle("potential vs. high.wage.ind")+ theme_classic()
ggplot(cleandf, aes(x=preferred_foot, y=potential)) + geom_boxplot() + ggtitle("potential vs. preferred_foot")+ theme_classic()
None of the scatter plots show strong linear relationships however do show some resemblance to a linear trend/pattern.
The first box plot suggests that the median potential for footballers who do not earn above 8000 euros a week is lower than the median of those that do earn higher than 8000 euros a week.
The second box plot suggests there is no difference in median potential between left footed players and right footed players.
Below I will entail some outliers that I found from the plots that were relevant and interesting to details about players’ potentials against other variables. The scatter plot for potential vs wage suggests majority of players earn below 100,000 Euros a week, for my subset of data there was only 6 players that earned above this weekly wage threshold and all had a potential attribute exclusively above 80. Furthermore, only 2 other individuals earned above 200,000 Euros a week. The box plot for potential against high wage indicator suggests that 4 players who were earning below 8000 euros a week had a significantly lower potential compared to the others that had the same wage threshold. Also there was 2 players who were earning below 8000 euros a week had a significantly higher potential compared to the others that had the same wage threshold. The box plot for potential against foot preference suggests that 3 players who were left footed had a significantly higher potential compared to other left footed players. There was also 2 players that were right footed that had a significantly higher potential compared to other right footed players and 3 right footed players who had a significantly lower potential compared to other right footed players.
When the data was originally read in to R, the variables were all either continuous or character data types. However I had to convert the variables club name, preferred foot and high wage indicator all to categorical/binary data. Also there was a labelling error for one of the levels for the preferred foot variable, which I had to correct changing it from a categorical variable to a binary variable. The research question asks to build a model for player potential, which is then a continuous response variable and the explanatory variables are now a combination of both continuous and binary/categorical variables. Incorporating the changes to the variables mentioned above allows me to use the ANOVA (analysis of variance) or ANCOVA (analysis of covariance) model depending on what explanatory variables I pick. The histogram for player wage produced a positive skew and from the bar chart for club name displayed too many levels (which would create too many estimates within my model), so I will not use these two variables in my model. Apart from this exclusion, the model I begin with will incorporate player potential against every other explanatory variable which suggests a ANCOVA model.
model<-lm(cleandf$potential~cleandf$age+cleandf$height_cm+cleandf$weight_kg+cleandf$preferred_foot+cleandf$pace+cleandf$shooting+cleandf$passing+cleandf$dribbling+cleandf$defending+cleandf$physic+cleandf$power_strength+cleandf$power_long_shots+cleandf$high.wage.ind) # ancova model uses lm function same as linear regression
summary(model) # produces general summary i.e the p value, r squared and f statistic
Call:
lm(formula = cleandf$potential ~ cleandf$age + cleandf$height_cm +
cleandf$weight_kg + cleandf$preferred_foot + cleandf$pace +
cleandf$shooting + cleandf$passing + cleandf$dribbling +
cleandf$defending + cleandf$physic + cleandf$power_strength +
cleandf$power_long_shots + cleandf$high.wage.ind)
Residuals:
Min 1Q Median 3Q Max
-14.7001 -2.3859 -0.0478 2.2922 10.3099
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 45.10397 6.15827 7.324 9.80e-13 ***
cleandf$age -0.78190 0.04380 -17.850 < 2e-16 ***
cleandf$height_cm 0.04206 0.03488 1.206 0.228547
cleandf$weight_kg 0.08373 0.03617 2.315 0.021036 *
cleandf$preferred_footRight 0.05804 0.36535 0.159 0.873831
cleandf$pace 0.01350 0.02040 0.662 0.508407
cleandf$shooting 0.16706 0.04054 4.121 4.43e-05 ***
cleandf$passing 0.01400 0.03537 0.396 0.692360
cleandf$dribbling 0.25544 0.03918 6.520 1.74e-10 ***
cleandf$defending 0.10258 0.01975 5.195 3.00e-07 ***
cleandf$physic 0.09636 0.05157 1.869 0.062281 .
cleandf$power_strength -0.02826 0.03821 -0.740 0.459920
cleandf$power_long_shots -0.10280 0.02968 -3.464 0.000579 ***
cleandf$high.wage.ind1 4.37147 0.45383 9.632 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.644 on 496 degrees of freedom
Multiple R-squared: 0.6676, Adjusted R-squared: 0.6589
F-statistic: 76.63 on 13 and 496 DF, p-value: < 2.2e-16
The F statistic is significant and r squared value is fairly high at 0.659, so far this model looks like a fairly good fit. The variables age, weight, shooting, dribbling, defending, long shots and high wage indicator have estimates for coefficients that are significant. Looking at the significance of the t value for each variable indicates how strong or weak an effect a variable has on potential.
The variables that are estimated to have a weak positive effect on potential are height(0.04), right footed(0.05), pace(0.01), passing(0.01) and physic(0.09). The only variable that is estimated to have a weak negative effect on potential is strength(0.02). The variables that are estimated to have a strong positive effect on potential are weight(0.08), shooting(0.167), dribbling(0.255), defending(0.102) and the high wage indicator(4.371). The variables that are estimated to have a strong negative effect on potential are age (0.782) and long shots(0.102).
Now we look at some plots of the model:
plot(model) # gives plots such as the Q-Q plot and residuals vs fitted
We mostly care about the first two plots. The fairly straight line produced in the residuals vs fitted model suggests that the variance is constant. The mostly straight diagonal line with a slight s shape in the Q-Q plot suggests the residuals are normally distributed. Hence the plots do not raise any concern about the ANCOVA model that was fitted. Since this model had some variables that were not significant I believe I can improve on this model by using the step function to reduce this model methodically to only include the variables that are significant to a players potential.
model2 <- step(model) # creating a new model with a function which removes insignificant variables systematically
Start: AIC=1332.8
cleandf$potential ~ cleandf$age + cleandf$height_cm + cleandf$weight_kg +
cleandf$preferred_foot + cleandf$pace + cleandf$shooting +
cleandf$passing + cleandf$dribbling + cleandf$defending +
cleandf$physic + cleandf$power_strength + cleandf$power_long_shots +
cleandf$high.wage.ind
Df Sum of Sq RSS AIC
- cleandf$preferred_foot 1 0.3 6587.3 1330.8
- cleandf$passing 1 2.1 6589.0 1331.0
- cleandf$pace 1 5.8 6592.7 1331.2
- cleandf$power_strength 1 7.3 6594.2 1331.4
- cleandf$height_cm 1 19.3 6606.2 1332.3
<none> 6586.9 1332.8
- cleandf$physic 1 46.4 6633.3 1334.4
- cleandf$weight_kg 1 71.2 6658.1 1336.3
- cleandf$power_long_shots 1 159.3 6746.2 1343.0
- cleandf$shooting 1 225.5 6812.4 1348.0
- cleandf$defending 1 358.4 6945.3 1357.8
- cleandf$dribbling 1 564.5 7151.4 1372.7
- cleandf$high.wage.ind 1 1232.2 7819.1 1418.3
- cleandf$age 1 4231.6 10818.5 1583.8
Step: AIC=1330.83
cleandf$potential ~ cleandf$age + cleandf$height_cm + cleandf$weight_kg +
cleandf$pace + cleandf$shooting + cleandf$passing + cleandf$dribbling +
cleandf$defending + cleandf$physic + cleandf$power_strength +
cleandf$power_long_shots + cleandf$high.wage.ind
Df Sum of Sq RSS AIC
- cleandf$passing 1 2.0 6589.3 1329.0
- cleandf$pace 1 5.8 6593.0 1329.3
- cleandf$power_strength 1 7.6 6594.8 1329.4
- cleandf$height_cm 1 19.5 6606.7 1330.3
<none> 6587.3 1330.8
- cleandf$physic 1 48.2 6635.4 1332.5
- cleandf$weight_kg 1 71.4 6658.7 1334.3
- cleandf$power_long_shots 1 159.2 6746.5 1341.0
- cleandf$shooting 1 225.3 6812.6 1346.0
- cleandf$defending 1 359.2 6946.5 1355.9
- cleandf$dribbling 1 564.3 7151.6 1370.7
- cleandf$high.wage.ind 1 1231.9 7819.1 1416.3
- cleandf$age 1 4232.9 10820.2 1581.9
Step: AIC=1328.98
cleandf$potential ~ cleandf$age + cleandf$height_cm + cleandf$weight_kg +
cleandf$pace + cleandf$shooting + cleandf$dribbling + cleandf$defending +
cleandf$physic + cleandf$power_strength + cleandf$power_long_shots +
cleandf$high.wage.ind
Df Sum of Sq RSS AIC
- cleandf$pace 1 5.4 6594.6 1327.4
- cleandf$power_strength 1 7.2 6596.4 1327.5
- cleandf$height_cm 1 18.7 6608.0 1328.4
<none> 6589.3 1329.0
- cleandf$physic 1 46.5 6635.8 1330.6
- cleandf$weight_kg 1 70.9 6660.2 1332.4
- cleandf$power_long_shots 1 159.8 6749.1 1339.2
- cleandf$shooting 1 226.1 6815.3 1344.2
- cleandf$defending 1 458.2 7047.5 1361.3
- cleandf$dribbling 1 897.0 7486.3 1392.1
- cleandf$high.wage.ind 1 1242.2 7831.4 1415.1
- cleandf$age 1 4251.5 10840.7 1580.9
Step: AIC=1327.4
cleandf$potential ~ cleandf$age + cleandf$height_cm + cleandf$weight_kg +
cleandf$shooting + cleandf$dribbling + cleandf$defending +
cleandf$physic + cleandf$power_strength + cleandf$power_long_shots +
cleandf$high.wage.ind
Df Sum of Sq RSS AIC
- cleandf$power_strength 1 10.2 6604.8 1326.2
- cleandf$height_cm 1 17.0 6611.6 1326.7
<none> 6594.6 1327.4
- cleandf$physic 1 56.5 6651.1 1329.8
- cleandf$weight_kg 1 70.2 6664.8 1330.8
- cleandf$power_long_shots 1 162.2 6756.8 1337.8
- cleandf$shooting 1 222.0 6816.6 1342.3
- cleandf$defending 1 462.2 7056.8 1359.9
- cleandf$dribbling 1 1146.1 7740.8 1407.1
- cleandf$high.wage.ind 1 1242.6 7837.3 1413.4
- cleandf$age 1 4477.5 11072.2 1589.7
Step: AIC=1326.18
cleandf$potential ~ cleandf$age + cleandf$height_cm + cleandf$weight_kg +
cleandf$shooting + cleandf$dribbling + cleandf$defending +
cleandf$physic + cleandf$power_long_shots + cleandf$high.wage.ind
Df Sum of Sq RSS AIC
- cleandf$height_cm 1 13.3 6618.1 1325.2
<none> 6604.8 1326.2
- cleandf$weight_kg 1 60.2 6665.0 1328.8
- cleandf$physic 1 75.1 6679.9 1330.0
- cleandf$power_long_shots 1 161.8 6766.6 1336.5
- cleandf$shooting 1 222.9 6827.7 1341.1
- cleandf$defending 1 560.9 7165.7 1365.8
- cleandf$dribbling 1 1195.3 7800.1 1409.0
- cleandf$high.wage.ind 1 1274.7 7879.5 1414.2
- cleandf$age 1 4602.4 11207.2 1593.8
Step: AIC=1325.21
cleandf$potential ~ cleandf$age + cleandf$weight_kg + cleandf$shooting +
cleandf$dribbling + cleandf$defending + cleandf$physic +
cleandf$power_long_shots + cleandf$high.wage.ind
Df Sum of Sq RSS AIC
<none> 6618.1 1325.2
- cleandf$physic 1 91.1 6709.2 1330.2
- cleandf$weight_kg 1 112.5 6730.6 1331.8
- cleandf$power_long_shots 1 179.6 6797.7 1336.9
- cleandf$shooting 1 248.8 6866.9 1342.0
- cleandf$defending 1 570.0 7188.1 1365.3
- cleandf$dribbling 1 1207.8 7825.8 1408.7
- cleandf$high.wage.ind 1 1302.1 7920.2 1414.8
- cleandf$age 1 4774.3 11392.4 1600.2
The reduced model is:
lm(cleandf$potential ~ cleandf$age + cleandf$weight_kg + cleandf$shooting + cleandf$dribbling + cleandf$defending + cleandf$physic + cleandf$power_long_shots + cleandf$high.wage.ind)
Call:
lm(formula = cleandf$potential ~ cleandf$age + cleandf$weight_kg +
cleandf$shooting + cleandf$dribbling + cleandf$defending +
cleandf$physic + cleandf$power_long_shots + cleandf$high.wage.ind)
Coefficients:
(Intercept) cleandf$age cleandf$weight_kg
52.88791 -0.79633 0.08825
cleandf$shooting cleandf$dribbling cleandf$defending
0.17238 0.26865 0.10814
cleandf$physic cleandf$power_long_shots cleandf$high.wage.ind1
0.07175 -0.10485 4.45239
We now run a summary again on this new model to see if we have improved from the last model:
summary(model2) # produces general summary i.e the p value, r squared and f statistic
Call:
lm(formula = cleandf$potential ~ cleandf$age + cleandf$weight_kg +
cleandf$shooting + cleandf$dribbling + cleandf$defending +
cleandf$physic + cleandf$power_long_shots + cleandf$high.wage.ind)
Residuals:
Min 1Q Median 3Q Max
-14.2921 -2.3252 -0.0316 2.3169 10.3527
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 52.88791 2.53970 20.825 < 2e-16 ***
cleandf$age -0.79633 0.04189 -19.011 < 2e-16 ***
cleandf$weight_kg 0.08825 0.03024 2.918 0.003676 **
cleandf$shooting 0.17238 0.03972 4.340 1.72e-05 ***
cleandf$dribbling 0.26865 0.02810 9.562 < 2e-16 ***
cleandf$defending 0.10814 0.01646 6.569 1.27e-10 ***
cleandf$physic 0.07175 0.02733 2.626 0.008911 **
cleandf$power_long_shots -0.10485 0.02843 -3.688 0.000251 ***
cleandf$high.wage.ind1 4.45239 0.44846 9.928 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.635 on 501 degrees of freedom
Multiple R-squared: 0.666, Adjusted R-squared: 0.6607
F-statistic: 124.9 on 8 and 501 DF, p-value: < 2.2e-16
We can see this time around on this model every variable is significant to the potential of the player. Again, now we plot the new model:
plot(model2) # gives plots such as the Q-Q plot and residuals vs fitted
There is minimal difference in plots compared to the previous model.
I have decided the reduced model I have built as the final model I will present for player potential. Although there is not a major improvement over the original model this model only includes variables that are significant, reducing the complexity of the model overall. The F statistic is still significant and r squared value is slightly better than the previous model at 0.661, again this model looks like a fairly good fit. All variables age, weight, shooting, dribbling, defending, physic, long shots and high wage indicator have estimates for coefficients that are significant. Looking at the significance of the t value for each variable indicates how strong an effect a variable has on potential.
The variables that are estimated to have a strong positive effect on potential are weight(0.09), shooting(0.172), dribbling(0.269), defending(0.108), physic(0.072) and high wage indicator(4.452) The variables that are estimated to have a strong negative effect on potential are age (0.796) and long shots(0.105). Similar to last time, the fairly straight lines produced in the residuals vs fitted and Q-Q plot suggests that the variance is constant and the residuals are normally distributed respectively. Once again this shows the plots indicate that the ANOVA model fits fine.
Although the line is straight for majority of the time in the residuals vs fitted plot, it shows a slight curve at the beginning of the plot. Looking at the residuals vs leverage plot, which is the same as the residuals vs fitted but on a standardized scale, the line becomes substantially more curved and majority of the points become clustered to the left side of the plot due to some outliers to the right side of the plot. Also for both of these plots the clusters do not focus towards the middle of the plot in terms of spread in the y axis. Also another minor weakness could be that R squared estimate can be misleading and is a biased estimator, as this term shows the percentage of variance that is accounted by the explanatory variables in the model, this value tends to go up simply by adding more explanatory variables to a model even though if they are not significant to the response variable. The adjusted R squared value tries to compensate for this by taking in to account the number of explanatory variables, but still is not a completely neutral indicator for the variance explained by the model.
The slight curve seen at the left side of the residuals vs fitted plot suggests that some variables may need to be transformed so instead of using linear terms perhaps some polynomials terms may need to be added to the model and possibly even interactions between variables could be used to improve this issue, this can be done by loading in the ‘tree’ package and check to see if interactions are required. The more pronounced curve seen in the residuals vs leverage plot and the majority of the cluster of points being towards the left side of the plot could be fixed by deleting the point which remains alone on the right side of the plot and again by transforming some of the explanatory variables. The problem with clusters being uneven in terms of spread in the y axis could be solved by perhaps transforming the response variable,player potential, and even experimenting with adding or removing a few of the explanatory variables to get a more equal spread.
Given that I have provided individual plots for each variable for this data frame earlier, I will go straight ahead and provide plots where high.wage.ind will be the binary response variable against every other explanatory variable. For explanatory variables that are continuous, I will produce a box plot against high.wage.ind and for explanatory variables that are binary/categorical I will produce a mosaic plot against high.wage.ind.
ggplot(cleandf, aes(x=high.wage.ind, y=potential)) + geom_boxplot() + ggtitle("potential vs. high wage indicator")+ theme_classic() #box plot with continuous variable vs high wage indicator
ggplot(cleandf, aes(x=high.wage.ind, y=wage_eur)) + geom_boxplot() + ggtitle("wage_eur vs. high wage indicator")+ theme_classic()
ggplot(cleandf, aes(x=high.wage.ind, y=age)) + geom_boxplot() + ggtitle("age vs. high wage indicator")+ theme_classic()
ggplot(cleandf, aes(x=high.wage.ind, y=height_cm)) + geom_boxplot() + ggtitle("height_cm vs. high wage indicator")+ theme_classic()
ggplot(cleandf, aes(x=high.wage.ind, y=weight_kg)) + geom_boxplot() + ggtitle("weight_kg vs. high wage indicator")+ theme_classic()
ggplot(cleandf, aes(x=high.wage.ind, y=pace)) + geom_boxplot() + ggtitle("pace vs. high wage indicator")+ theme_classic()
ggplot(cleandf, aes(x=high.wage.ind, y=shooting)) + geom_boxplot() + ggtitle("shooting vs. high wage indicator")+ theme_classic()
ggplot(cleandf, aes(x=high.wage.ind, y=passing)) + geom_boxplot() + ggtitle("passing vs. high wage indicator")+ theme_classic()
ggplot(cleandf, aes(x=high.wage.ind, y=dribbling)) + geom_boxplot() + ggtitle("dribbling vs. high wage indicator")+ theme_classic()
ggplot(cleandf, aes(x=high.wage.ind, y=defending)) + geom_boxplot() + ggtitle("defending vs. high wage indicator")+ theme_classic()
ggplot(cleandf, aes(x=high.wage.ind, y=physic)) + geom_boxplot() + ggtitle("physic vs. high wage indicator")+ theme_classic()
ggplot(cleandf, aes(x=high.wage.ind, y=power_strength)) + geom_boxplot() + ggtitle("power_strength vs. high wage indicator")+ theme_classic()
ggplot(cleandf, aes(x=high.wage.ind, y=power_long_shots)) + geom_boxplot() + ggtitle("power_long_shots vs. high wage indicator")+ theme_classic()
mosaicplot(cleandf$high.wage.ind~cleandf$preferred_foot, main ="Mosaic Plot of high.wage.ind against preferred_foot",ylab="preferred_foot", xlab="high.wage.ind") #plot with categorical variable vs high wage indicator
mosaicplot(cleandf$high.wage.ind~cleandf$club_name, main ="Mosaic Plot of high.wage.ind against club_name",ylab="club_name", xlab="high.wage.ind")
The box plots suggest that: the median potential for footballers who do not earn above 8000 euros a week is lower than the median of those that do earn higher than 8000 euros a week. the median weekly wage for footballers who do not earn above 8000 euros a week is lower than the median of those that do earn higher than 8000 euros a week. the median age for footballers who do not earn above 8000 euros a week is slightly lower than the median of those that do earn higher than 8000 euros a week. the median height for footballers who do not earn above 8000 euros a week is slightly lower than the median of those that do earn higher than 8000 euros a week. the median weight for footballers who do not earn above 8000 euros a week is slightly lower than the median of those that do earn higher than 8000 euros a week. the median pace for footballers who do not earn above 8000 euros a week is lower than the median of those that do earn higher than 8000 euros a week. the median shooting for footballers who do not earn above 8000 euros a week is lower than the median of those that do earn higher than 8000 euros a week. the median passing for footballers who do not earn above 8000 euros a week is lower than the median of those that do earn higher than 8000 euros a week. the median dribbling for footballers who do not earn above 8000 euros a week is lower than the median of those that do earn higher than 8000 euros a week. the median defending for footballers who do not earn above 8000 euros a week is lower than the median of those that do earn higher than 8000 euros a week. the median physic for footballers who do not earn above 8000 euros a week is lower than the median of those that do earn higher than 8000 euros a week. the median strength for footballers who do not earn above 8000 euros a week is lower than the median of those that do earn higher than 8000 euros a week. the median long shot for footballers who do not earn above 8000 euros a week is lower than the median of those that do earn higher than 8000 euros a week.
We can see that the proportion of right footed players that earn above 8000 euros a week is larger than that for left footed players and the same statement can be said for players that do not earn above 8000 euros a week.
The mosaic plot for club name against whether or not a player earned above 8000 euros a week is undecipherable as there are too many levels for the club name variable.
The second research question asks to build a model using the high.wage.ind variable, which is a binary response variable and the explanatory variables are a combination of both continuous and binary/categorical variables. Taking this in to account allows me to use the logistic regression model or even a count data method such as chi squared test or the fishers test depending on what explanatory variables I pick. The mosaic plot for club name displayed too many levels and have hence decided I will not use this variable. Therefore, I will begin modelling with will all explanatory variables apart from club name using the logistic regression model.
logistic.glm<-glm(cleandf$high.wage.ind~cleandf$potential+cleandf$wage_eur+cleandf$age+cleandf$height_cm+cleandf$weight_kg+cleandf$preferred_foot+cleandf$pace+cleandf$shooting+cleandf$passing+cleandf$dribbling+cleandf$defending+cleandf$physic+cleandf$power_strength+cleandf$power_long_shots, family = "binomial") # logistic model uses the 'binomial' response variable
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
summary(logistic.glm) # produces general summary i.e the deviance value and AIC value
Call:
glm(formula = cleandf$high.wage.ind ~ cleandf$potential + cleandf$wage_eur +
cleandf$age + cleandf$height_cm + cleandf$weight_kg + cleandf$preferred_foot +
cleandf$pace + cleandf$shooting + cleandf$passing + cleandf$dribbling +
cleandf$defending + cleandf$physic + cleandf$power_strength +
cleandf$power_long_shots, family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-2.934e-04 -2.000e-08 -2.000e-08 2.000e-08 3.426e-04
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.121e+02 1.223e+05 -0.003 0.998
cleandf$potential 1.851e-01 5.825e+02 0.000 1.000
cleandf$wage_eur 3.546e-02 2.006e+00 0.018 0.986
cleandf$age 1.482e-01 5.517e+02 0.000 1.000
cleandf$height_cm -7.777e-02 5.154e+02 0.000 1.000
cleandf$weight_kg 4.417e-02 5.089e+02 0.000 1.000
cleandf$preferred_footRight 3.536e-01 3.527e+03 0.000 1.000
cleandf$pace -1.977e-03 1.576e+02 0.000 1.000
cleandf$shooting 2.297e-01 6.133e+02 0.000 1.000
cleandf$passing -8.560e-02 4.430e+02 0.000 1.000
cleandf$dribbling 2.683e-02 6.781e+02 0.000 1.000
cleandf$defending -1.050e-02 2.200e+02 0.000 1.000
cleandf$physic 2.416e-01 5.650e+02 0.000 1.000
cleandf$power_strength -1.129e-01 4.432e+02 0.000 1.000
cleandf$power_long_shots -2.378e-01 5.830e+02 0.000 1.000
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6.0707e+02 on 509 degrees of freedom
Residual deviance: 7.5300e-07 on 495 degrees of freedom
AIC: 30
Number of Fisher Scoring iterations: 25
exp(coef(logistic.glm)) # gives the odds ratio in respect to the response variable
(Intercept) cleandf$potential
2.825396e-136 1.203381e+00
cleandf$wage_eur cleandf$age
1.036097e+00 1.159704e+00
cleandf$height_cm cleandf$weight_kg
9.251803e-01 1.045157e+00
cleandf$preferred_footRight cleandf$pace
1.424210e+00 9.980247e-01
cleandf$shooting cleandf$passing
1.258242e+00 9.179580e-01
cleandf$dribbling cleandf$defending
1.027188e+00 9.895564e-01
cleandf$physic cleandf$power_strength
1.273340e+00 8.932411e-01
cleandf$power_long_shots
7.883343e-01
Initial look at this model shows that it is not a good at all. None of the z values are significant although the residual deviance is almost none existent and is very close to zero when all explanatory variables are included, which shows the model explains the data pretty well. All estimates calculated from this model are all close to zero or none existent. Also the odds ratios in this model all being so small do not make a significant increase or decrease to the odds of a player making over 8000 euros weekly or not.
I will now try to reduce the model in attempts to get a better fit by using the step function and repeating the processes again such as summarising the new model as well as exponentiating the new odds ratios.
steplog.glm <- step(logistic.glm) # creates model by reducing insignificant variables
Start: AIC=30
cleandf$high.wage.ind ~ cleandf$potential + cleandf$wage_eur +
cleandf$age + cleandf$height_cm + cleandf$weight_kg + cleandf$preferred_foot +
cleandf$pace + cleandf$shooting + cleandf$passing + cleandf$dribbling +
cleandf$defending + cleandf$physic + cleandf$power_strength +
cleandf$power_long_shots
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- cleandf$passing 1 0.00 28.00
- cleandf$pace 1 0.00 28.00
- cleandf$preferred_foot 1 0.00 28.00
- cleandf$dribbling 1 0.00 28.00
- cleandf$physic 1 0.00 28.00
- cleandf$power_long_shots 1 0.00 28.00
- cleandf$shooting 1 0.00 28.00
- cleandf$defending 1 0.00 28.00
- cleandf$power_strength 1 0.00 28.00
- cleandf$height_cm 1 0.00 28.00
- cleandf$weight_kg 1 0.00 28.00
- cleandf$age 1 0.00 28.00
- cleandf$potential 1 0.00 28.00
<none> 0.00 30.00
- cleandf$wage_eur 1 277.25 305.25
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=28
cleandf$high.wage.ind ~ cleandf$potential + cleandf$wage_eur +
cleandf$age + cleandf$height_cm + cleandf$weight_kg + cleandf$preferred_foot +
cleandf$pace + cleandf$shooting + cleandf$dribbling + cleandf$defending +
cleandf$physic + cleandf$power_strength + cleandf$power_long_shots
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- cleandf$power_long_shots 1 0.00 26.00
- cleandf$physic 1 0.00 26.00
- cleandf$preferred_foot 1 0.00 26.00
- cleandf$pace 1 0.00 26.00
- cleandf$dribbling 1 0.00 26.00
- cleandf$shooting 1 0.00 26.00
- cleandf$defending 1 0.00 26.00
- cleandf$power_strength 1 0.00 26.00
- cleandf$height_cm 1 0.00 26.00
- cleandf$age 1 0.00 26.00
- cleandf$weight_kg 1 0.00 26.00
- cleandf$potential 1 0.00 26.00
<none> 0.00 28.00
- cleandf$wage_eur 1 277.25 303.25
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=26
cleandf$high.wage.ind ~ cleandf$potential + cleandf$wage_eur +
cleandf$age + cleandf$height_cm + cleandf$weight_kg + cleandf$preferred_foot +
cleandf$pace + cleandf$shooting + cleandf$dribbling + cleandf$defending +
cleandf$physic + cleandf$power_strength
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- cleandf$preferred_foot 1 0.00 24.00
- cleandf$physic 1 0.00 24.00
- cleandf$defending 1 0.00 24.00
- cleandf$pace 1 0.00 24.00
- cleandf$dribbling 1 0.00 24.00
- cleandf$shooting 1 0.00 24.00
- cleandf$height_cm 1 0.00 24.00
- cleandf$power_strength 1 0.00 24.00
- cleandf$weight_kg 1 0.00 24.00
- cleandf$age 1 0.00 24.00
- cleandf$potential 1 0.00 24.00
<none> 0.00 26.00
- cleandf$wage_eur 1 277.45 301.45
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=24
cleandf$high.wage.ind ~ cleandf$potential + cleandf$wage_eur +
cleandf$age + cleandf$height_cm + cleandf$weight_kg + cleandf$pace +
cleandf$shooting + cleandf$dribbling + cleandf$defending +
cleandf$physic + cleandf$power_strength
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- cleandf$defending 1 0.00 22.00
- cleandf$pace 1 0.00 22.00
- cleandf$physic 1 0.00 22.00
- cleandf$shooting 1 0.00 22.00
- cleandf$dribbling 1 0.00 22.00
- cleandf$height_cm 1 0.00 22.00
- cleandf$power_strength 1 0.00 22.00
- cleandf$weight_kg 1 0.00 22.00
- cleandf$age 1 0.00 22.00
- cleandf$potential 1 0.00 22.00
<none> 0.00 24.00
- cleandf$wage_eur 1 281.58 303.58
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=22
cleandf$high.wage.ind ~ cleandf$potential + cleandf$wage_eur +
cleandf$age + cleandf$height_cm + cleandf$weight_kg + cleandf$pace +
cleandf$shooting + cleandf$dribbling + cleandf$physic + cleandf$power_strength
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- cleandf$pace 1 0.00 20.00
- cleandf$dribbling 1 0.00 20.00
- cleandf$shooting 1 0.00 20.00
- cleandf$height_cm 1 0.00 20.00
- cleandf$physic 1 0.00 20.00
- cleandf$power_strength 1 0.00 20.00
- cleandf$weight_kg 1 0.00 20.00
- cleandf$age 1 0.00 20.00
- cleandf$potential 1 0.00 20.00
<none> 0.00 22.00
- cleandf$wage_eur 1 282.96 302.96
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=20
cleandf$high.wage.ind ~ cleandf$potential + cleandf$wage_eur +
cleandf$age + cleandf$height_cm + cleandf$weight_kg + cleandf$shooting +
cleandf$dribbling + cleandf$physic + cleandf$power_strength
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- cleandf$shooting 1 0.00 18.00
- cleandf$height_cm 1 0.00 18.00
- cleandf$physic 1 0.00 18.00
- cleandf$dribbling 1 0.00 18.00
- cleandf$power_strength 1 0.00 18.00
- cleandf$weight_kg 1 0.00 18.00
- cleandf$age 1 0.00 18.00
- cleandf$potential 1 0.00 18.00
<none> 0.00 20.00
- cleandf$wage_eur 1 282.96 300.96
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=18
cleandf$high.wage.ind ~ cleandf$potential + cleandf$wage_eur +
cleandf$age + cleandf$height_cm + cleandf$weight_kg + cleandf$dribbling +
cleandf$physic + cleandf$power_strength
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- cleandf$physic 1 0.00 16.00
- cleandf$height_cm 1 0.00 16.00
- cleandf$power_strength 1 0.00 16.00
- cleandf$weight_kg 1 0.00 16.00
- cleandf$dribbling 1 0.00 16.00
- cleandf$age 1 0.00 16.00
- cleandf$potential 1 0.00 16.00
<none> 0.00 18.00
- cleandf$wage_eur 1 283.31 299.31
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=16
cleandf$high.wage.ind ~ cleandf$potential + cleandf$wage_eur +
cleandf$age + cleandf$height_cm + cleandf$weight_kg + cleandf$dribbling +
cleandf$power_strength
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- cleandf$power_strength 1 0.00 14.00
- cleandf$height_cm 1 0.00 14.00
- cleandf$weight_kg 1 0.00 14.00
- cleandf$dribbling 1 0.00 14.00
- cleandf$age 1 0.00 14.00
- cleandf$potential 1 0.00 14.00
<none> 0.00 16.00
- cleandf$wage_eur 1 287.02 301.02
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=14
cleandf$high.wage.ind ~ cleandf$potential + cleandf$wage_eur +
cleandf$age + cleandf$height_cm + cleandf$weight_kg + cleandf$dribbling
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- cleandf$height_cm 1 0.00 12.00
- cleandf$weight_kg 1 0.00 12.00
- cleandf$age 1 0.00 12.00
- cleandf$dribbling 1 0.00 12.00
- cleandf$potential 1 0.00 12.00
<none> 0.00 14.00
- cleandf$wage_eur 1 287.07 299.07
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=12
cleandf$high.wage.ind ~ cleandf$potential + cleandf$wage_eur +
cleandf$age + cleandf$weight_kg + cleandf$dribbling
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- cleandf$dribbling 1 0.00 10.00
- cleandf$age 1 0.00 10.00
- cleandf$weight_kg 1 0.00 10.00
- cleandf$potential 1 0.00 10.00
<none> 0.00 12.00
- cleandf$wage_eur 1 288.21 298.21
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=10
cleandf$high.wage.ind ~ cleandf$potential + cleandf$wage_eur +
cleandf$age + cleandf$weight_kg
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- cleandf$weight_kg 1 0 8
- cleandf$age 1 0 8
- cleandf$potential 1 0 8
<none> 0 10
- cleandf$wage_eur 1 296 304
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=8
cleandf$high.wage.ind ~ cleandf$potential + cleandf$wage_eur +
cleandf$age
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- cleandf$age 1 0.0 6.0
- cleandf$potential 1 0.0 6.0
<none> 0.0 8.0
- cleandf$wage_eur 1 299.7 305.7
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=6
cleandf$high.wage.ind ~ cleandf$potential + cleandf$wage_eur
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Df Deviance AIC
- cleandf$potential 1 0.0 4.0
<none> 0.0 6.0
- cleandf$wage_eur 1 403.4 407.4
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Step: AIC=4
cleandf$high.wage.ind ~ cleandf$wage_eur
Df Deviance AIC
<none> 0.00 4.00
- cleandf$wage_eur 1 607.07 609.07
summary(steplog.glm) # produces general summary i.e the deviance value and AIC value
Call:
glm(formula = cleandf$high.wage.ind ~ cleandf$wage_eur, family = "binomial")
Deviance Residuals:
Min 1Q Median 3Q Max
-1.941e-04 -2.100e-08 -2.100e-08 2.100e-08 1.803e-04
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.036e+02 1.527e+04 -0.02 0.984
cleandf$wage_eur 3.572e-02 1.801e+00 0.02 0.984
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 6.0707e+02 on 509 degrees of freedom
Residual deviance: 9.1235e-07 on 508 degrees of freedom
AIC: 4
Number of Fisher Scoring iterations: 25
exp(coef(steplog.glm)) # gives the odds ratio in respect to the response variable
(Intercept) cleandf$wage_eur
1.454187e-132 1.036368e+00
The final reduced model is:
glm(formula = cleandf$high.wage.ind ~ cleandf$wage_eur, family = "binomial")
Warning: glm.fit: algorithm did not converge
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Call: glm(formula = cleandf$high.wage.ind ~ cleandf$wage_eur, family = "binomial")
Coefficients:
(Intercept) cleandf$wage_eur
-303.56679 0.03572
Degrees of Freedom: 509 Total (i.e. Null); 508 Residual
Null Deviance: 607.1
Residual Deviance: 9.124e-07 AIC: 4
Unfortunately we can again see this reduced model is not much better than the original maximal model. The only explanatory variable that is in this model is the weekly wage a player receives, which is not a surprise as the response variable is whether a players weekly wage is above a certain amount. Even though this is the only explanatory variable in the model it still is not significant. Although the z value is not significant, the residual deviance is still incredibly small and is very close to zero, which again shows the model explains the data fairly well.The variable weekly wage has a very weak positive effect on whether or not the player earns above 8000 euros (by a positive difference of 0.0357 which is almost negligible). Also the odds ratio suggests that the wage of a player increases the chance of earning above or below 8000 euros a week by a factor of 1.036, so again not a significant difference. Although the AIC had improved by going down from 30 in the first model to 4 in the second (due to the use of fewer parameters/explanatory variables), the same can not be said for the model selection from the original model to the this reduced one as I conclude there is no discernible difference in the models as they are both bad predictors of whether a player makes 8000 euros a week or not.
As the best model I could produce for a players weekly wage being above 8000 euros or not was one of which that had no significant explanatory variables and one variable at that, there is room for improvement but finding ways to improve a model for this variable is not terribly obvious. One option I could think of would be to entirely disregard this variable and use the weekly wage variable (wage_eur) as the response variable to build a model regarding player wages, which would require multiple regression or ANCOVA and hopefully we would get a model where the explanatory variables explains a higher proportion of the variance seen in the response variable.